Spatial interaction models

In today’s lab we will walk through the steps of building a spatial interaction model of commuting flows between London boroughs, using population size and median salary information. The code in this lab is modified from an example given by Dr. Adam Dennett of UCL’s Center for Applied Spatial Analysis. You will need a set of R packages to carry out all parts of this lab, so make sure these are installed, then start loading them:

library(sf)
library(MASS)
library(reshape2)
library(maptools)
library(dplyr)
library(stplanr)
library(ggplot2)

Read in data

First, let’s read in the commuting flow data. This is taken from the 2001 UK Census, downloaded from https://wicid.ukdataservice.ac.uk/.

cdata = read.csv("../datafiles/LondonCommuting2001.csv")
head(cdata)
##             Orig OrigCode                 Dest DestCode Total WorksFromHome
## 1 City of London     00AA       City of London     00AA  2059           432
## 2 City of London     00AA Barking and Dagenham     00AB     6             0
## 3 City of London     00AA               Barnet     00AC    14             0
## 4 City of London     00AA               Bexley     00AD     0             0
## 5 City of London     00AA                Brent     00AE    16             0
## 6 City of London     00AA              Bromley     00AF     0             0
##   Underground Train Bus Taxi CarDrive CarPass Motobike Bicycle Walk Other
## 1         120    53  50   31       39       0        3      18 1272    41
## 2           3     3   0    0        0       0        0       0    0     0
## 3          11     0   0    0        0       0        0       0    3     0
## 4           0     0   0    0        0       0        0       0    0     0
## 5          10     0   3    0        0       0        0       0    3     0
## 6           0     0   0    0        0       0        0       0    0     0
##   OrigCodeNew DestCodeNew OrigPop OrigSal DestPop DestSal
## 1   E09000001   E09000001   12000   38300   12000   38300
## 2   E09000001   E09000002   12000   38300   56000   16200
## 3   E09000001   E09000003   12000   38300  159000   18700
## 4   E09000001   E09000004   12000   38300  112000   18300
## 5   E09000001   E09000005   12000   38300  127000   16500
## 6   E09000001   E09000006   12000   38300  164000   19100

If we take a look at the first few lines of the file you will see that each line gives the commuting flow between one origin (Orig) and one destination (Dest). Commuting flows are given as total flow (Total), and broken down by different modes of transportation. In the last few columns, population and income variables have been added for each origin-destination pair. There are a total of 1089 pairs of commuting flows.

nrow(cdata)
## [1] 1089

Next, read in and plot a shapefile of the London boroughs using the st_read() function:

LondonBNG = st_read("../datafiles/LondonBNG/London.shp")
## Reading layer `London' from data source 
##   `/Users/u0784726/Dropbox/Data/devtools/geog5160/datafiles/LondonBNG/London.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 33 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 503576.3 ymin: 155850.7 xmax: 561958.7 ymax: 200934
## Projected CRS: Transverse_Mercator
plot(st_geometry(LondonBNG))

Before we get to modeling the flow data, there are a couple of pre-processing steps that we need to undertake.

First, calculate a matrix of distances between all pairs of boroughs (this will be the \(d_{ij}\) term in our gravity model), and for this we use the function st_distance() between the borough centroids (extracted with st_distance()):

dist <- st_distance(st_centroid(LondonBNG))
## Warning in st_centroid.sf(LondonBNG): st_centroid assumes attributes are
## constant over geometries of x
dist <- units::drop_units(dist)

dist[1:10,1:10]
##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,]     0.000 16021.039 13906.457 17310.372 13119.231 18774.436  5725.908
##  [2,] 16021.039     0.000 25075.425  9629.985 27872.037 20102.098 20178.865
##  [3,] 13906.457 25075.425     0.000 29942.769  7543.571 32664.101  8561.550
##  [4,] 17310.372  9629.985 29942.769     0.000 30428.865 11465.489 22852.272
##  [5,] 13119.231 27872.037  7543.571 30428.865     0.000 30386.018  7775.904
##  [6,] 18774.436 20102.098 32664.101 11465.489 30386.018     0.000 24234.649
##  [7,]  5725.908 20178.865  8561.550 22852.272  7775.904 24234.649     0.000
##  [8,] 17749.005 26110.225 30228.613 19571.845 25859.849  9836.842 21806.562
##  [9,] 16595.005 32325.347 13363.359 33488.710  5940.999 31423.634 12331.705
## [10,] 15150.050 19294.077  9347.132 26574.560 16193.111 32501.537 12613.023
##            [,8]      [,9]     [,10]
##  [1,] 17749.005 16595.005 15150.050
##  [2,] 26110.225 32325.347 19294.077
##  [3,] 30228.613 13363.359  9347.132
##  [4,] 19571.845 33488.710 26574.560
##  [5,] 25859.849  5940.999 16193.111
##  [6,]  9836.842 31423.634 32501.537
##  [7,] 21806.562 12331.705 12613.023
##  [8,]     0.000 25161.818 32891.779
##  [9,] 25161.818     0.000 22128.305
## [10,] 32891.779 22128.305     0.000

Next, we need to transform this for use with the commuting flow data. Here, we use the function melt() from the reshape2 package. This is a neat little function that takes a matrix or some other wide format data, and melts it down into a thin vector format, where the first two columns refer to a row-column pair from the distance matrix, and the third gives the distance between them.

distPair <- melt(dist, varnames = c("Dest","Orig"), value.name = "dist")
head(distPair)
##   Dest Orig     dist
## 1    1    1     0.00
## 2    2    1 16021.04
## 3    3    1 13906.46
## 4    4    1 17310.37
## 5    5    1 13119.23
## 6    6    1 18774.44

Now add the distance column back into the dataframe, and take a look the modified data frame:

cdata$dist <- distPair$dist
head(cdata)

Next we will swap the order of some of the columns. This is not important for modeling, but will allow us to visualize the flow data. For visualization, we will use the od2line function from the stplanr package. This takes a set of O/D pairs and generates a spatial network between them, by extracting the coordinates of each pair from a spatial object. It requires that the first two columns of the O/D data frame are the codes of the origin and destination that match the spatial location in the shapefile. There are a couple of ways of doing this, but to make our lives as easy as possible, we’ll use the select() function from the dplyr package:

cdata2 <- dplyr::select(cdata, OrigCodeNew, DestCodeNew, Total, everything())
cdata2$Orig <- as.factor(cdata2$Orig)
cdata2$Dest <- as.factor(cdata2$Dest)

And finally, we remove the intra-borough flows. As the distances generated by st_distance() for these are zero, these can cause in the models we generate. These can be included, by setting the internal borough travel distance to some positive value, but to make out lives easier, we will just exclude them here and focus on the inter-borough flows.

cdata2 <- cdata2[cdata2$OrigCode!=cdata2$DestCode,]

With that done, we can visualize the flow data using the od2line function from the stplanr package. This needs a set of pairs of locations (from cdata) and a shapefile with the spatial geometry of the individual locations (this is the shapefile we loaded earlier). With these, it will create a a network between locations:

travel_network <- od2line(flow = cdata2, zones = LondonBNG)
## Creating centroids representing desire line start and end points.

To get some sense of the size of the flows, we can use this variable to set the line widths of the network links (as a weighting factor w), and then plot it:

w <- cdata2$Total / max(cdata2$Total) * 10
plot(st_geometry(LondonBNG))
plot(st_geometry(travel_network), lwd = w, add=TRUE)

Now we can create pivot table to turn paired list into matrix (and compute the margins as well)

cdata2mat <- dcast(cdata2, Orig ~ Dest, sum, 
                   value.var = "Total", margins=c("Orig", "Dest"))
cdata2mat[1:10,1:10]
##                    Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1  Barking and Dagenham                    0    194     96   178      66   1500
## 2                Barnet                   96      0     34  5467      76  12080
## 3                Bexley                  362    132      0   144    4998   2470
## 4                 Brent                   40   6124     28     0      66   8105
## 5               Bromley                  134    162   3199   201       0   3780
## 6                Camden                   36   1496     32  1350      60      0
## 7        City of London                    6     14      0    16       0    335
## 8               Croydon                   85    204    300   329    5152   3248
## 9                Ealing                   30    967     33  5263      91   4547
## 10              Enfield                  217   5642     52  1038      76   5588
##    City of London Croydon Ealing
## 1            3641     104    188
## 2            7709     148   1573
## 3            6580     710    188
## 4            4145     187   6703
## 5            9855    6268    293
## 6            8795     147    643
## 7               0       3      9
## 8            5925       0    405
## 9            4855     182      0
## 10           5212     136    626

Modeling

Now let’s produce a gravity model based on this flow data. The equation for the classic unconstrained gravity model is:

\[ T_{ij}=k V_i^\mu W_j^\alpha d_{ij}^\beta \]

It’s perfectly possible to produce some flow estimates by plugging some arbitrary or expected estimated values into our parameters. The parameters relate to the scaling effect / importance of the variables they are associated with.

A starting assumption may be that the effects of origin and destination attributes on flows scale in a linear fashion (i.e. for a 1 unit increase in the population at an origin results in a 1 unit increase in flows of people from that origin, or halving the average salary at a destination, will half the inflow of commuters). This would imply that both \(\mu\) and \(\alpha\) are set to 1. For \(\beta\), we will use a value of -2, suggesting that commuting flow follows a power law with respect to distance.

With these coefficients, we can start to make our flow estimates. To try and keep this as clear as possible, we’ll estimate each term in the gravity mode equation individually.

  • First the term \(V_i^\mu\)
mu <- 1
vi1_mu <- cdata2$OrigPop^mu
  • Next the term \(W_j^\alpha\)
alpha <- 1
wj2_alpha <- cdata2$DestSal^alpha
  • Then the term \(d_{ij}^\mu\)
beta <- -2
dist_beta <- cdata2$dist^beta

Finally, we need an estimate of \(k\), the proportionality constant. As the name suggests, this is a constant, and is used to avoid over- or under-estimating the total number of flows. When we have some information about the expected flow data, we can estimate \(k\) based on the total flows in the region and the equation given above:

\[ k = \frac{T}{\sum_i \sum_j V_i^\mu W_j^\alpha d_{ij}^\beta} \]

Where \(T\) is the total number of all commuting flows or trips in the dataset. Effectively, this then scales the amount of flows based on population and salary alone to the observed number of trips, which has the effect of making the predicted value of the total number of trips (\(\sum \hat{T}_{ij}\)) correct. Note that if we don’t know the total flow data, \(k\) is usually taken from previous studies or by trial and error. (While we refer to this model as an unconstrained gravity model, the use of \(\sum T_{ij}\) technically makes this a total constrained model.)

k = sum(cdata2$Total) / sum(vi1_mu * wj2_alpha * dist_beta)

We can now use the gravity model to estimate the flows by simply multiplying all this together, and store this back in the original data frame:

cdata2$unconstrainedEst1 <- round(k * vi1_mu * wj2_alpha * dist_beta, 0)

To make sure that \(k\) is doing it’s job correctly, estimate the observed total number of flows:

sum(cdata2$Total)
## [1] 1800413

And compare this to the predicted total:

sum(cdata2$unconstrainedEst1)
## [1] 1800411

If you want to see the results as an O/D matrix, we can use the cast() or dcast() function to turn the set of inter-borough flows into a matrix.

#turn it into a little matrix and have a look at your handy work
cdata2mat1 <- dcast(cdata2, Orig ~ Dest, sum, 
                    value.var = "unconstrainedEst1", margins=c("Orig", "Dest"))

And if we look at a part of this matrix, we can see the resulting predicted flows:

cdata2mat1[1:10,1:10]
##                    Orig Barking and Dagenham Barnet Bexley Brent Bromley Camden
## 1  Barking and Dagenham                    0    178   1180   127     283    291
## 2                Barnet                  438      0    347  4925     304   4588
## 3                Bexley                 2090    250      0   213    1738    454
## 4                 Brent                  283   4458    268     0     281   4442
## 5               Bromley                  702    307   2439   313       0    591
## 6                Camden                  429   2752    378  2944     351      0
## 7        City of London                   81    124     78   123      69    774
## 8               Croydon                  388    334    781   403    3226    681
## 9                Ealing                  234   1577    246  7041     291   1961
## 10              Enfield                  595   2927    354   860     247   1702
##    City of London Croydon Ealing
## 1             893     161    100
## 2            3364     340   1664
## 3            1529     572    187
## 4            3019     371   6726
## 5            1904    3313    310
## 6           12603     415   1242
## 7               0      74     81
## 8            1987       0    452
## 9            2095     435      0
## 10           2282     231    489

Model goodness-of-fit

For a small set of locations, we can take a look at this matrix and compare it to known flows (the cdata2mat matrix shown above). For larger datasets, this quickly becomes impractical. Instead, let’s calculate the root mean squared error (RMSE) between the observed and predicted flows.

As a reminder, the RMSE is given by the following equation, and is basically an estimate of the average error in the model estimates. The closer to 0 the RMSE value, the better the model.

\[ RMSE = \sqrt{\frac{\sum_i(y_i-\hat{y}_i)^2}{n}} \]

sqrt(mean((cdata2$Total-cdata2$unconstrainedEst1)^2))
## [1] 3179.464

Calibrating the model

The traditional way to improve the model fit has been by gradually changing the set of parameters, and rebuilding the model until the RMSE is as low as possible. However, we can also do this by regression. If we log-transform the gravity model given above, we get the following equation:

\[ \mbox{log} T_{ij}=\mbox{log}(k) + \mu \mbox{log}(V_i) + \alpha \mbox{log} (W_j) + \beta \mbox{log} (d_{ij}) \]

Which you should recognize as a basic linear regression model, where the intercept gives us an estimate of \(\mbox{log}(k)\), and the slope coefficients give us estimates of \(\mu\), \(\alpha\) and \(\beta\), respectively.

To get a sense of what this is trying to do, let’s make a plot of the untransformed flow data against the individual OD distances.

plot(cdata$dist, cdata$Total,
     main="London commuting flow data", xlab="Dist (dij)", ylab="Flow (Tij)")

Not a very linear fit between the flow and distance. If we do the same, but log-transform both variables, we get the following:

plot(log(cdata$dist), log(cdata$Total),
     main="London commuting flow data", 
     xlab="Dist (dij)", ylab="Flow (Tij)")

And while there is a lot of scatter, we can see a linearly declining relationship: less people travel over longer distances. If we could estimate the slope of this decline, this would give us the estimate of \(\beta\).

While it is entirely feasible to fit this using a regular linear regression, based on ordinary least squares (OLS), this approach has been criticized as being unsuitable for modeling flow data. OLS models are designed for use with continuous, unbounded outcome variable (i.e. something that can potentially take a value from -ve infinity to +ve infinity). Flow data cannot be negative, at the minimum a flow is equal to or bounded by zero. Also, we are generally dealing with unit objects: a person, a car, etc. A better approach is based on the use of Poisson regression, another form of generalized linear model (GLM). This models the flows directly as untransformed counts, using the following equation:

\[ T_{ij}=\mbox{exp}( \mbox{log}(k) + \mu \mbox{log}(V_i) + \alpha \mbox{log} (W_j) + \beta \mbox{log} (d_{ij}) ) \]

All this has done is to eliminate the log term on the left hand side and replace it with an exponent on the right hand side. This means that we model the untransformed counts, which by definition are non-negative (bounded at zero) and integer values. This has some further advantages in how the model treats the errors, which make this approach more robust.

To fit this in R, we use the glm() function, which can be used to fit a GLM. Note that we need to specify that this is a Poisson model with the family = poisson() argument. The syntax to fit this for the London flow data is:

uncosim <- glm(Total ~ log(OrigPop)+log(DestSal)+log(dist), 
               family = poisson(link = "log"), data = cdata2)

Note that we log transform the covariates (population, etc). Now let’s look at the output:

summary(uncosim)
## 
## Call:
## glm(formula = Total ~ log(OrigPop) + log(DestSal) + log(dist), 
##     family = poisson(link = "log"), data = cdata2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -148.329   -26.953   -16.723     1.279   189.270  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -12.522966   0.047412  -264.1   <2e-16 ***
## log(OrigPop)   1.622671   0.003130   518.4   <2e-16 ***
## log(DestSal)   1.549826   0.002979   520.2   <2e-16 ***
## log(dist)     -1.498169   0.001268 -1181.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3374332  on 1055  degrees of freedom
## Residual deviance: 1413323  on 1052  degrees of freedom
## AIC: 1421746
## 
## Number of Fisher Scoring iterations: 5

So what does all this tell us? The most important is the table of coefficients. We get the following estimates:

  • Intercept (\(\mbox{log}(k)\)): -12.523
  • log(OrigPop) (\(\mu\)): 1.623
  • log(DestSal) (\(\alpha\)): 1.55
  • log(dist) (\(\beta\)): -1.498

The final column of this table gives an estimate of the significance of the coefficient (Pr(>\z\)). The lower this is, the higher the significance in explaining the flow data. Here all the coefficients we have used are highly significant.

Comparing these to the original simple model above, we see that this first model underestimates the coefficients for emission (\(\mu\)) attraction (\(\alpha\)). As both of these are above 1, it suggest that as both population and salary increase in an area, this results in proportionally higher commuting flows. The distance coefficient (\(\beta\)) was overestimated, but is still above 1, indicating the per-unit travel cost increase as distance increases.

As before, we can calculate the RMSE for this model to compare against the simple model. To do this we use the fitted() function to get the predicted flows for each pair of locations, then round up.

cdata2$unconstrainedEst2 = round(fitted(uncosim), 0)
head(cdata2$unconstrainedEst2)
## [1]  25  39  27  35  26 162

Now reuse the code from above to calculate the RMSE.

## Model 1
sqrt(mean((cdata2$Total-cdata2$unconstrainedEst1)^2))
## [1] 3179.464
## Model 2
sqrt(mean((cdata2$Total-cdata2$unconstrainedEst2)^2))
## [1] 2331.069

Which shows that, for our calibrated model, the RMSE has dropped by approximately 25%.

And calculate the total estimated flow:

sum(cdata2$unconstrainedEst2)
## [1] 1800399

Which matches well to the total observed flow:

sum(cdata2$Total)
## [1] 1800413

This is unsurprising as we used the sum of all flows to calibrate the \(k\) coefficient. Next convert your estimates back to an O/D matrix and look at the estimates:

cdata2mat2 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "unconstrainedEst2", margins=c("Orig", "Dest"))
cdata2mat2[1:10,1:10]

While the total is well constrained, the margins of the matrix are not. For example, we can compare the estimates of total outflow from each origin by looking at the row totals of the observed O/D matrix, and the new modeled one. If we exclude the extreme values, the predicted totals appear somewhat biased.

totoutflowObs = cdata2mat[,"(all)"]
totoutflowPred = cdata2mat2[,"(all)"]
plot(totoutflowObs, totoutflowPred, log="xy",
     xlab="Observed outflow", ylab="Predicted outflow")
abline(0,1)

And we can do the same for the inflows. To do this, we need to extract the relevant row, which is a little more complex.

totinflowObs = as.numeric(cdata2mat[cdata2mat$Orig=="(all)", -1])
totinflowPred = as.numeric(cdata2mat2[cdata2mat2$Orig=="(all)", -1])
plot(totinflowObs, totinflowPred, log="xy",
     xlab="Observed inflow", ylab="Predicted inflow")
abline(0,1)

Now, let’s recycle the code from above to plot the predicted flows:

w <- cdata2$unconstrainedEst2 / max(cdata2$unconstrainedEst2) * 10
plot(st_geometry(LondonBNG))
plot(st_geometry(travel_network), lwd = w, add=TRUE)

We can also plot the model errors on the same network. To do this, we first calculate the residuals, then translate them to network weights, using a somewhat arbitrary weighting. Finally, we plot the absolute values, color coded by if they are positive (red) or negative (blue) errors).

r <- cdata2$unconstrainedEst2 - cdata2$Total
w <- abs(r) / sum(abs(r)) * 250
err <- ifelse(r > 0, 2, 4)
plot(st_geometry(LondonBNG))
plot(st_geometry(travel_network), lwd = w, add=TRUE, col=err)

Outro

The work here is designed to show that it is relatively easy to build a range of spatial interaction models, from arbitrary hand-picked coefficients, to more complex, constrained models. Placing these in the framework of Poisson regression allows a fairly straightforward way to build the models, but also offers the possibility of extending these model by including more variables that may be relevant to attracting or emitting flows between locations.

It should also be possible to see from this, that the same approach could be applied to more specific flows (the London commuting file contains information about different modes for example), or extended to include different ‘cost’ information to replace the distance variable. Other improvements include modeling intra-regional flows as well as flows between regions, and the use of more complex models (e.g. negative binomial) to deal with some of the limitations of the Poisson model.

Exercise

The CSV file aussie_flow.csv contains 5-year migration data between Australian Greater Capital City Statistical Areas (GCCSA). The file contains the following columns

  • Origin: Origin GCCSA
  • Orig_code: Origin GCCSA code
  • Destination: Destination GCCSA
  • Dest_code: Destination GCCSA code
  • Flow: Migration flow (including intra-GCCSA flow)
  • vi1_origpop: origin population size
  • wj1_destpop: destination population size
  • vi2_origunemp: origin unemployment rate
  • wj2_destunemp: destination unemployment rate
  • vi3_origmedinc: origin median income
  • wj3_destmedinc: destination median income
  • vi4_origpctrent: origin percent rentals
  • wj4_destpctrent: destination percent rentals
  • FlowNoIntra: Migration flow (excluding intra-GCCSA flow)
  • offset: constant offset
  • dist: distance between GCCSA centroids (km)

Using these data, build two unconstrained (not constrained!) gravity models of migration flow, using the origin population as the emission factor and destination income as the attraction factor. As the file also contains intra-zone migration, you will need to remove those before estimating the models, by finding all observations with the same origin and destination:

ausdata <- read.csv("../datafiles/aussie_flow.csv")
ausdata <- ausdata[ausdata$Orig_code != ausdata$Dest_code,]

  1. Model 1:
  • Use \(\mu = 1\), \(\alpha = 1\) and \(\beta = 2\) for this model.
  • Estimate and provide the value of \(k\) using the code below (this is copied from the lab)
  • Use these coefficients to estimate interzone flows
  • Calculate the RMSE of your estimates
mu <- 1
vi1_mu <- ausdata$OrigPop^mu
alpha <- 1
wj2_alpha <- ausdata$DestSal^alpha
beta <- -2
dist_beta <- ausdata$dist^beta
k <- sum(ausdata$Total) / sum(vi1_mu * wj2_alpha * dist_beta)
  1. Model 2:
  • Use the glm function to fit a Poisson model of the flow data as a function of origin population and destination income
  • Provide the output of the summary of the model
  • List the model coefficients, give the matching model coefficient (\(\mu\), \(\alpha\), \(\beta\) and \(k\))
  • Estimate the predicted flows
  • Calculate the RMSE
  • Is \(\beta\) above or below one? Why do you think it has that value?

Appendix: Fitting constrained models

THIS IS AN OPTIONAL PART OF THE LAB

In the previous step, we estimated an unconstrained model. While here, we were able to calibrate this using the full set of flow (as these were available), this model can also be fit to partial, incomplete O/D flow information. As we have the full O/D matrix, we can improve this model by using extra information to constrain the model.

  • If we use the row (origin) totals, we get a production constrained model, e.g. if we have information on the amount of disposable income and shopping habits of the people living in different areas from loyalty card data

  • If we use the column (destination) totals, we get an attraction constrained model, e.g. if we have information on new infrastructure developments such as a new airport, and wish to understand the impact on transportation flows

  • If we have use both row and column totals, we get the doubly constrained model

Production constrained model

This model is given by

\[ T_{ij} = A_i O_i W_j^\alpha d_{ij}^\beta \]

Where \(O_i\) represents the total outflow from zone \(i\) \[ O_i=\sum_j T_{ij} \]

And \(A_i\) is a balancing factor it account for different levels of productivity. To model this, we just rewrite the equation of the Poisson model to allow estimates of this factor:

\[ T_{ij}=\mbox{exp}( k + \mu_i + \alpha \mbox{log} (W_j) + \beta \mbox{log} (d_{ij}) ) \]

Here the coefficient \(\mu_i\) is a vector of balancing factors for each zone. In a model, this is just a set of dummy variables, which we represent here by using the Borough name. In R, the syntax for this model is given as:

prodSim <- glm(Total ~ Orig + log(DestSal) + log(dist), 
               family = poisson(link = "log"), data = cdata2)
summary(prodSim)

The main difference between this and the previous model is that we no longer have a single estimate of \(\mu\), but instead have set of coefficients for each origin. You might see that the first location is missing (the borough of Barking and Dagenham). This is used as the the reference location, and the rest of the coefficients indicate how important the other locations are as origins for commuters.

We might wish to have a less arbitrary choice of location as origin. For London, the City of London (shown on the map below) can be assumed to be the main destination for commuters.

plot(st_geometry(LondonBNG))
plot(st_geometry(LondonBNG[LondonBNG$lad15nm=="City of London",]), add=TRUE, col=2)

To set this as the reference level in R, use the following code:

cdata2$Orig2 = relevel(cdata2$Orig, ref = "City of London")

And let’s use this to build a second model with the City of London as reference:

prodSim2 <- glm(Total ~ Orig2 + log(DestSal) + log(dist), 
               family = poisson(link = "log"), data = cdata2)
summary(prodSim2)

Now all these coefficients (\(\mu_i\)) are relative to the City of London. Let’s map them out to see what the spatial pattern looks like. Here, we use the coef() function to extract these, and add them to the shapefile. Note that we need to remove the 1st and last two coefficients (the intercept and \(\alpha\) and \(\beta\) slopes).

modcoef = coef(prodSim2)[-c(1, 34, 35)]
LondonBNG$prodcoef = c(0, modcoef)

And then plot it, to show how commuting flows increase as we move further away from the center closer to the suburbs.

plot(LondonBNG["prodcoef"])

Now let’s explore the predicted flows. First, get the flow for each O/D pair from the original model:

cdata2$prodsimFitted <- round(fitted(prodSim), 0)

Then create an O/D matrix, with the marginal sums of total outflow (rows) and inflows (columns):

cdata2mat3 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "prodsimFitted", 
                    margins=c("Orig", "Dest"))
cdata2mat3

If we now redo the comparison of the outflow values, we can see that the constraint has worked and the outflows match.

totoutflowObs = cdata2mat[, "(all)"]
totoutflowPred = cdata2mat3[, "(all)"]
plot(totoutflowObs, totoutflowPred, log="xy",
     xlab = "Observed outflow", ylab = "Predicted outflow")
abline(0, 1)

Including the constraint has also improved the model RMSE, showing some benefit of incorporating this extra information in the model.

sqrt(mean((cdata2$Total - cdata2$prodsimFitted)^2))
## [1] 2204.673

Attraction constrained model

The attraction constrained model uses information about the total inflow to each region to constrain the model:

\[ T_{ij} = D_j B_j V_i^\mu d_{ij}^\beta \]

Where \(D_i\) represents the total outflow from zone \(i\) \[ D_j=\sum_i T_{ij} \]

And \(B_j\) is a balancing factor it account for different levels of attraction. We make a similar modification to the Poisson model to accommodate this:

\[ T_{ij}=\mbox{exp}( k + \mu \mbox{log} V_i + \alpha_i + \beta \mbox{log} (d_{ij}) ) \]

Now we have a vector of dummy variables (\(\alpha_i\)), one per zone, representing the level of attractiveness for commuting. In R, we simply include the destinations rather than the origins. Again, we’ll set the City of London as the destination:

cdata2$Dest2 = relevel(cdata2$Dest, ref = "City of London")
attrSim <- glm(Total ~ Dest2 + log(OrigPop) + log(dist), 
               family = poisson(link = "log"), data = cdata2)
summary(attrSim)
coef(attrSim)

The output here again contains a single coefficient per region (minus the reference region), but here indicates how much more (or less) likely it is to be a destination for commuters.

modcoef = coef(attrSim)[-c(1,34,35)]
LondonBNG$attrcoef = c(0, modcoef)

And then plot it, to show how commuting flows increase as we move further away from the center closer to the suburbs.

plot(LondonBNG["attrcoef"])

Now let’s get the flow for each O/D pair, and compare to the observations:

cdata2$attrsimFitted <- round(fitted(attrSim),0)

Create an O/D matrix, with the marginal sums of total outflow (rows) and inflows (columns):

cdata2mat4 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "attrsimFitted", 
                    margins=c("Orig", "Dest"))
cdata2mat4

Now redo the comparison of the inflow values:

totinflowObs = as.numeric(cdata2mat[cdata2mat$Orig=="(all)", -1])
totinflowPred = as.numeric(cdata2mat4[cdata2mat4$Orig=="(all)", -1])
plot(totinflowObs, totinflowPred, log = "xy",
     xlab = "Observed inflow", ylab = "Predicted inflow")
abline(0,1)

And calculate the RMSE, which shows that including the destination constraint has a substantial improvement on the model:

sqrt(mean((cdata2$Total-cdata2$attrsimFitted)^2))
## [1] 1681.193

Doubly constrained model

Of course, the next obvious step is to combine these two constraints in the doubly constrained model:

\[ T_{ij} = A_i O_i D_j B_j d_{ij}^\beta \]

Estimating this by hand is somewhat difficult, as the value of \(A_i\) depends on \(B_j\) and vice versa. It therefore requires a set of iterations to converge on the best estimates. In our case, however, there is no need to do this, and we can just run Poisson regression with a full set of the full set of factors (origins and destinations).

doubSim <- glm(Total ~ Orig + Dest + log(dist), 
               family = poisson(link = "log"), data = cdata2)
#let's have a look at it's summary...
summary(doubSim)

As before, get the estimated flow for each O/D pair:

cdata2$doubsimFitted <- round(fitted(doubSim),0)

Then create an O/D matrix, with the marginal sums of total outflow (rows) and inflows (columns):

cdata2mat5 <- dcast(cdata2, Orig ~ Dest, sum, value.var = "doubsimFitted", 
                    margins=c("Orig", "Dest"))
cdata2mat5[1:10,1:10]

The double constraint results in a marked improvement in the model fit, and a drop in the RMSE.

sqrt(mean((cdata2$Total-cdata2$doubsimFitted)^2))
## [1] 1136.545